Brazilian E-Commerce Public Database by Olist https://www.kaggle.com/olistbr/brazilian-ecommerce?select=olist_order_items_dataset.csv
A database foi provida pela Olist, uma empresa de marketplace situada no Brasil. A Olist conecta pequenos negócios por todo o Brasil de uma forma simples. Vendedores conseguem vender seus produtos diretamente pela Olist Store, as entregas são feitas por meio de parceiros.
Quais são os principais fatores que alavancam o número de vendas? Quais os fatores que impactam na avaliação do cliente?
Esse é o segundo notebook, onde iremos realizar clusterização e analisar os mesmos afim de promover insights. No primeiro notebook com o prefixo '01', realizamos toda a visualização do conjunto, onde foram retirados alguns insights que serão utilizados ao longo das proximas etapas.
import pandas as pd
import numpy as np
import re
import plotly.graph_objects as go
from warnings import simplefilter
from functools import reduce
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from math import ceil
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from dash import dash_table
from IPython.core.display import HTML, display
# Versões dos pacotes usados neste jupyter notebook
%reload_ext watermark
%watermark -a "Herikc Brecher e João Hutner" --iversions
Author: Herikc Brecher e João Hutner dash : 2.0.0 re : 2.2.1 plotly: 5.1.0 pandas: 1.2.4 numpy : 1.19.5
simplefilter(action='ignore', category=FutureWarning)
# Seed de Aleatoriedade
seed_ = 194
np.random.seed(seed_)
# Carregamento de todos datasets
dtCustomers = pd.read_csv('../data/olist_customers_dataset.csv', encoding = 'utf8', dtype={'customer_zip_code_prefix': str})
dtOrderItems = pd.read_csv('../data/olist_order_items_dataset.csv', encoding = 'utf8')
dtOrderPayments = pd.read_csv('../data/olist_order_payments_dataset.csv', encoding = 'utf8')
dtOrderReviews = pd.read_csv('../data/olist_order_reviews_dataset.csv', encoding = 'utf8')
dtOrders = pd.read_csv('../data/olist_orders_dataset.csv', encoding = 'utf8')
Iremos adicionar uma variavel extra para todos os datasets que contém o Estado. A variavel adicionada é a Região referente ao Estado.
regioes = {
'AC': 'Norte',
'AL': 'Nordeste',
'AP': 'Norte',
'AM': 'Norte',
'BA': 'Nordeste',
'CE': 'Nordeste',
'DF': 'CentroOeste',
'ES': 'Sudeste',
'GO': 'CentroOeste',
'MA': 'Nordeste',
'MT': 'CentroOeste',
'MS': 'CentroOeste',
'MG': 'Sudeste',
'PA': 'Norte',
'PB': 'Nordeste',
'PR': 'Sul',
'PE': 'Nordeste',
'PI': 'Nordeste',
'RJ': 'Sudeste',
'RN': 'Nordeste',
'RS': 'Sul',
'RO': 'Norte',
'RR': 'Norte',
'SC': 'Sul',
'SP': 'Sudeste',
'SE': 'Nordeste',
'TO': 'Norte'
}
dtCustomers['regiao'] = dtCustomers['customer_state'].copy()
dtCustomers = dtCustomers.replace({'regiao': regioes})
dtCustomers.head()
| customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | regiao | |
|---|---|---|---|---|---|---|
| 0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP | Sudeste |
| 1 | 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 09790 | sao bernardo do campo | SP | Sudeste |
| 2 | 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 01151 | sao paulo | SP | Sudeste |
| 3 | b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 08775 | mogi das cruzes | SP | Sudeste |
| 4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP | Sudeste |
Devido ao problema em questão ser dividido em 8 datasets diferentes, iremos construir datasets mais unificados afim de facilitar o processo de tratamento dos dados e clusterização.
dtCustomers.head()
| customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | regiao | |
|---|---|---|---|---|---|---|
| 0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP | Sudeste |
| 1 | 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 09790 | sao bernardo do campo | SP | Sudeste |
| 2 | 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 01151 | sao paulo | SP | Sudeste |
| 3 | b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 08775 | mogi das cruzes | SP | Sudeste |
| 4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP | Sudeste |
dtOrderItems.head()
| order_id | order_item_id | product_id | seller_id | shipping_limit_date | price | freight_value | |
|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 1 | 4244733e06e7ecb4970a6e2683c13e61 | 48436dade18ac8b2bce089ec2a041202 | 2017-09-19 09:45:35 | 58.90 | 13.29 |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 1 | e5f2d52b802189ee658865ca93d83a8f | dd7ddc04e1b6c2c614352b383efe2d36 | 2017-05-03 11:05:13 | 239.90 | 19.93 |
| 2 | 000229ec398224ef6ca0657da4fc703e | 1 | c777355d18b72b67abbeef9df44fd0fd | 5b51032eddd242adc84c38acab88f23d | 2018-01-18 14:48:30 | 199.00 | 17.87 |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 1 | 7634da152a4610f1595efa32f14722fc | 9d7a1d34a5052409006425275ba1c2b4 | 2018-08-15 10:10:18 | 12.99 | 12.79 |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 1 | ac6c3623068f30de03045865e4e10089 | df560393f3a51e74553ab94004ba5c87 | 2017-02-13 13:57:51 | 199.90 | 18.14 |
dtOrderPayments.head()
| order_id | payment_sequential | payment_type | payment_installments | payment_value | |
|---|---|---|---|---|---|
| 0 | b81ef226f3fe1789b1e8b2acac839d17 | 1 | credit_card | 8 | 99.33 |
| 1 | a9810da82917af2d9aefd1278f1dcfa0 | 1 | credit_card | 1 | 24.39 |
| 2 | 25e8ea4e93396b6fa0d3dd708e76c1bd | 1 | credit_card | 1 | 65.71 |
| 3 | ba78997921bbcdc1373bb41e913ab953 | 1 | credit_card | 8 | 107.78 |
| 4 | 42fdf880ba16b47b59251dd489d4441a | 1 | credit_card | 2 | 128.45 |
dtOrderReviews.head()
| review_id | order_id | review_score | review_comment_title | review_comment_message | review_creation_date | review_answer_timestamp | |
|---|---|---|---|---|---|---|---|
| 0 | 7bc2406110b926393aa56f80a40eba40 | 73fc7af87114b39712e6da79b0a377eb | 4 | NaN | NaN | 2018-01-18 00:00:00 | 2018-01-18 21:46:59 |
| 1 | 80e641a11e56f04c1ad469d5645fdfde | a548910a1c6147796b98fdf73dbeba33 | 5 | NaN | NaN | 2018-03-10 00:00:00 | 2018-03-11 03:05:13 |
| 2 | 228ce5500dc1d8e020d8d1322874b6f0 | f9e4b658b201a9f2ecdecbb34bed034b | 5 | NaN | NaN | 2018-02-17 00:00:00 | 2018-02-18 14:36:24 |
| 3 | e64fb393e7b32834bb789ff8bb30750e | 658677c97b385a9be170737859d3511b | 5 | NaN | Recebi bem antes do prazo estipulado. | 2017-04-21 00:00:00 | 2017-04-21 22:02:06 |
| 4 | f7c4243c7fe1938f181bec41a392bdeb | 8e6bfb81e283fa7e4f11123a3fb894f1 | 5 | NaN | Parabéns lojas lannister adorei comprar pela I... | 2018-03-01 00:00:00 | 2018-03-02 10:26:53 |
dtOrders.head()
| order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | |
|---|---|---|---|---|---|---|---|---|
| 0 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-02 11:07:15 | 2017-10-04 19:55:00 | 2017-10-10 21:25:13 | 2017-10-18 00:00:00 |
| 1 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-07-26 03:24:27 | 2018-07-26 14:31:00 | 2018-08-07 15:27:45 | 2018-08-13 00:00:00 |
| 2 | 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-08-08 08:55:23 | 2018-08-08 13:50:00 | 2018-08-17 18:06:29 | 2018-09-04 00:00:00 |
| 3 | 949d5b44dbf5de918fe9c16f97b45f8a | f88197465ea7920adcdbec7375364d82 | delivered | 2017-11-18 19:28:06 | 2017-11-18 19:45:59 | 2017-11-22 13:39:59 | 2017-12-02 00:28:42 | 2017-12-15 00:00:00 |
| 4 | ad21c59c0840e6cb83a9ceb5573f8159 | 8ab97904e6daea8866dbdbc4fb7aad2c | delivered | 2018-02-13 21:18:39 | 2018-02-13 22:20:29 | 2018-02-14 19:46:34 | 2018-02-16 18:17:02 | 2018-02-26 00:00:00 |
listDtMergeOrders = [dtOrderItems[['order_id', 'freight_value']], dtOrderPayments, dtOrderReviews, dtOrders,\
dtOrderItems.groupby('order_id').agg({'order_item_id': max}).reset_index()]
dtGeneralOrders = reduce(lambda left, right: pd.merge(left, right, on = 'order_id'), listDtMergeOrders)
dtGeneralOrders = pd.merge(dtGeneralOrders, dtCustomers[['customer_id', 'customer_city', 'customer_state', 'regiao']],\
on = 'customer_id')
dtGeneralOrders.head()
| order_id | freight_value | payment_sequential | payment_type | payment_installments | payment_value | review_id | review_score | review_comment_title | review_comment_message | ... | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | order_item_id | customer_city | customer_state | regiao | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 13.29 | 1 | credit_card | 2 | 72.19 | 97ca439bc427b48bc1cd7177abe71365 | 5 | NaN | Perfeito, produto entregue antes do combinado. | ... | delivered | 2017-09-13 08:59:02 | 2017-09-13 09:45:35 | 2017-09-19 18:34:16 | 2017-09-20 23:43:48 | 2017-09-29 00:00:00 | 1 | campos dos goytacazes | RJ | Sudeste |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 19.93 | 1 | credit_card | 3 | 259.83 | 7b07bacd811c4117b742569b04ce3580 | 4 | NaN | NaN | ... | delivered | 2017-04-26 10:53:06 | 2017-04-26 11:05:13 | 2017-05-04 14:35:00 | 2017-05-12 16:04:24 | 2017-05-15 00:00:00 | 1 | santa fe do sul | SP | Sudeste |
| 2 | 000229ec398224ef6ca0657da4fc703e | 17.87 | 1 | credit_card | 5 | 216.87 | 0c5b33dea94867d1ac402749e5438e8b | 5 | NaN | Chegou antes do prazo previsto e o produto sur... | ... | delivered | 2018-01-14 14:33:31 | 2018-01-14 14:48:30 | 2018-01-16 12:36:48 | 2018-01-22 13:19:16 | 2018-02-05 00:00:00 | 1 | para de minas | MG | Sudeste |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 12.79 | 1 | credit_card | 2 | 25.78 | f4028d019cb58564807486a6aaf33817 | 4 | NaN | NaN | ... | delivered | 2018-08-08 10:00:35 | 2018-08-08 10:10:18 | 2018-08-10 13:28:00 | 2018-08-14 13:32:39 | 2018-08-20 00:00:00 | 1 | atibaia | SP | Sudeste |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 18.14 | 1 | credit_card | 3 | 218.04 | 940144190dcba6351888cafa43f3a3a5 | 5 | NaN | Gostei pois veio no prazo determinado . | ... | delivered | 2017-02-04 13:57:51 | 2017-02-04 14:10:13 | 2017-02-16 09:46:09 | 2017-03-01 16:42:31 | 2017-03-17 00:00:00 | 1 | varzea paulista | SP | Sudeste |
5 rows × 23 columns
# Mantendo somente Ano, Mes e Dia
dtGeneralOrders['order_purchase_timestamp'] = \
pd.to_datetime(dtGeneralOrders['order_purchase_timestamp'], format = '%Y-%m-%d %H:%M:%S').dt.to_period('d')
# Mantendo somente Ano, Mes e Dia
dtGeneralOrders['order_delivered_customer_date'] = \
pd.to_datetime(dtGeneralOrders['order_delivered_customer_date'], format = '%Y-%m-%d %H:%M:%S').dt.to_period('d')
# Mantendo somente Ano, Mes e Dia
dtGeneralOrders['order_estimated_delivery_date'] = \
pd.to_datetime(dtGeneralOrders['order_estimated_delivery_date'], format = '%Y-%m-%d %H:%M:%S').dt.to_period('d')
dtGeneralOrders['diff_delivery_and_estimate'] = dtGeneralOrders['order_estimated_delivery_date'] -\
dtGeneralOrders['order_delivered_customer_date']
dtGeneralOrders['diff_delivery_and_purchase'] = dtGeneralOrders['order_purchase_timestamp'] -\
dtGeneralOrders['order_delivered_customer_date']
dtGeneralOrders['diff_delivery_and_estimate'] = \
dtGeneralOrders['diff_delivery_and_estimate'].apply(lambda x: re.sub("[^0-9]", "", str(x)))
dtGeneralOrders['diff_delivery_and_purchase'] = \
dtGeneralOrders['diff_delivery_and_purchase'].apply(lambda x: re.sub("[^0-9]", "", str(x)))
dtGeneralOrders['diff_delivery_and_estimate'] = pd.to_numeric(dtGeneralOrders['diff_delivery_and_estimate'])
dtGeneralOrders['diff_delivery_and_purchase'] = pd.to_numeric(dtGeneralOrders['diff_delivery_and_purchase'])
dtGeneralOrders.head()
| order_id | freight_value | payment_sequential | payment_type | payment_installments | payment_value | review_id | review_score | review_comment_title | review_comment_message | ... | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | order_item_id | customer_city | customer_state | regiao | diff_delivery_and_estimate | diff_delivery_and_purchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 13.29 | 1 | credit_card | 2 | 72.19 | 97ca439bc427b48bc1cd7177abe71365 | 5 | NaN | Perfeito, produto entregue antes do combinado. | ... | 2017-09-13 09:45:35 | 2017-09-19 18:34:16 | 2017-09-20 | 2017-09-29 | 1 | campos dos goytacazes | RJ | Sudeste | 9.0 | 7.0 |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 19.93 | 1 | credit_card | 3 | 259.83 | 7b07bacd811c4117b742569b04ce3580 | 4 | NaN | NaN | ... | 2017-04-26 11:05:13 | 2017-05-04 14:35:00 | 2017-05-12 | 2017-05-15 | 1 | santa fe do sul | SP | Sudeste | 3.0 | 16.0 |
| 2 | 000229ec398224ef6ca0657da4fc703e | 17.87 | 1 | credit_card | 5 | 216.87 | 0c5b33dea94867d1ac402749e5438e8b | 5 | NaN | Chegou antes do prazo previsto e o produto sur... | ... | 2018-01-14 14:48:30 | 2018-01-16 12:36:48 | 2018-01-22 | 2018-02-05 | 1 | para de minas | MG | Sudeste | 14.0 | 8.0 |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 12.79 | 1 | credit_card | 2 | 25.78 | f4028d019cb58564807486a6aaf33817 | 4 | NaN | NaN | ... | 2018-08-08 10:10:18 | 2018-08-10 13:28:00 | 2018-08-14 | 2018-08-20 | 1 | atibaia | SP | Sudeste | 6.0 | 6.0 |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 18.14 | 1 | credit_card | 3 | 218.04 | 940144190dcba6351888cafa43f3a3a5 | 5 | NaN | Gostei pois veio no prazo determinado . | ... | 2017-02-04 14:10:13 | 2017-02-16 09:46:09 | 2017-03-01 | 2017-03-17 | 1 | varzea paulista | SP | Sudeste | 16.0 | 25.0 |
5 rows × 25 columns
dtGeneralOrders = dtGeneralOrders[['order_id', 'freight_value', 'payment_type', 'payment_installments', 'payment_value',\
'review_score', 'order_status', 'order_item_id',\
'diff_delivery_and_estimate', 'diff_delivery_and_purchase',\
'customer_state', 'regiao']]
dtGeneralOrders = dtGeneralOrders[dtGeneralOrders['order_status'] == 'delivered']
dtGeneralOrders = dtGeneralOrders.drop('order_status', axis = 1)
dtGeneralOrders = dtGeneralOrders.rename(columns = {'order_item_id': 'quantity', 'regiao': 'customer_region'})
# Dropando valores NA
dtGeneralOrders = dtGeneralOrders.dropna()
dtGeneralOrders.head()
| order_id | freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | customer_region | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 13.29 | credit_card | 2 | 72.19 | 5 | 1 | 9.0 | 7.0 | RJ | Sudeste |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 19.93 | credit_card | 3 | 259.83 | 4 | 1 | 3.0 | 16.0 | SP | Sudeste |
| 2 | 000229ec398224ef6ca0657da4fc703e | 17.87 | credit_card | 5 | 216.87 | 5 | 1 | 14.0 | 8.0 | MG | Sudeste |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 12.79 | credit_card | 2 | 25.78 | 4 | 1 | 6.0 | 6.0 | SP | Sudeste |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 18.14 | credit_card | 3 | 218.04 | 5 | 1 | 16.0 | 25.0 | SP | Sudeste |
colunas_quantitativas = ['freight_value', 'payment_value', 'quantity',\
'diff_delivery_and_estimate', 'diff_delivery_and_purchase']
dtGeneralOrders.head()
| order_id | freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | customer_region | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 13.29 | credit_card | 2 | 72.19 | 5 | 1 | 9.0 | 7.0 | RJ | Sudeste |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 19.93 | credit_card | 3 | 259.83 | 4 | 1 | 3.0 | 16.0 | SP | Sudeste |
| 2 | 000229ec398224ef6ca0657da4fc703e | 17.87 | credit_card | 5 | 216.87 | 5 | 1 | 14.0 | 8.0 | MG | Sudeste |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 12.79 | credit_card | 2 | 25.78 | 4 | 1 | 6.0 | 6.0 | SP | Sudeste |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 18.14 | credit_card | 3 | 218.04 | 5 | 1 | 16.0 | 25.0 | SP | Sudeste |
dtGeneralOrdersProcessado = dtGeneralOrders.copy()
# Realizando LabelEncoder
labelencoderPaymentType = LabelEncoder()
labelencoderCustomerState = LabelEncoder()
labelencoderCustomerRegion = LabelEncoder()
dtGeneralOrdersProcessado['payment_type'] = labelencoderPaymentType.fit_transform(dtGeneralOrders['payment_type'])
dtGeneralOrdersProcessado['customer_state'] =\
labelencoderCustomerState.fit_transform(dtGeneralOrders['customer_state'])
dtGeneralOrdersProcessado['customer_region'] =\
labelencoderCustomerRegion.fit_transform(dtGeneralOrders['customer_region'])
dtGeneralOrdersProcessado.head()
| order_id | freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | customer_region | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 13.29 | 1 | 2 | 72.19 | 5 | 1 | 9.0 | 7.0 | 18 | 3 |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 19.93 | 1 | 3 | 259.83 | 4 | 1 | 3.0 | 16.0 | 25 | 3 |
| 2 | 000229ec398224ef6ca0657da4fc703e | 17.87 | 1 | 5 | 216.87 | 5 | 1 | 14.0 | 8.0 | 10 | 3 |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 12.79 | 1 | 2 | 25.78 | 4 | 1 | 6.0 | 6.0 | 25 | 3 |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 18.14 | 1 | 3 | 218.04 | 5 | 1 | 16.0 | 25.0 | 25 | 3 |
scaler = StandardScaler()
#scaler = MinMaxScaler()
dtGeneralOrdersNormalizado = dtGeneralOrdersProcessado.copy()
dtGeneralOrdersNormalizado[colunas_quantitativas] = scaler.fit_transform(dtGeneralOrdersProcessado[colunas_quantitativas])
dtGeneralOrdersFinal = dtGeneralOrdersNormalizado.drop('order_id', axis = 1)
dtGeneralOrdersFinal.head()
| freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | customer_region | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.426745 | 1 | 2 | -0.375092 | 5 | -0.354355 | -0.564103 | -0.567311 | 18 | 3 |
| 1 | -0.005297 | 1 | 3 | 0.329257 | 4 | -0.354355 | -1.298337 | 0.385421 | 25 | 3 |
| 2 | -0.136047 | 1 | 5 | 0.167997 | 5 | -0.354355 | 0.047760 | -0.461452 | 10 | 3 |
| 3 | -0.458481 | 1 | 2 | -0.549303 | 4 | -0.354355 | -0.931220 | -0.673170 | 25 | 3 |
| 4 | -0.118910 | 1 | 3 | 0.172389 | 5 | -0.354355 | 0.292504 | 1.338153 | 25 | 3 |
# Criando um modelo
modelo = KMeans(n_clusters = 8, random_state = seed_)
modelo.fit(dtGeneralOrdersFinal)
KMeans(random_state=194)
dtGeneralOrdersClusterizado = dtGeneralOrdersProcessado.copy()
dtGeneralOrdersClusterizado = dtGeneralOrdersClusterizado.drop('order_id', axis = 1)
dtGeneralOrdersClusterizado['cluster'] = modelo.labels_
dtGeneralOrdersClusterizado['payment_type'] =\
labelencoderPaymentType.inverse_transform(dtGeneralOrdersClusterizado['payment_type'])
dtGeneralOrdersClusterizado['customer_state'] =\
labelencoderCustomerState.inverse_transform(dtGeneralOrdersClusterizado['customer_state'])
dtGeneralOrdersClusterizado['customer_region'] =\
labelencoderCustomerRegion.inverse_transform(dtGeneralOrdersClusterizado['customer_region'])
metodos_pagamento = {
'credit_card': 'Credito',
'debit_card': 'Debito'
}
dtGeneralOrdersClusterizado = dtGeneralOrdersClusterizado.replace({'payment_type': metodos_pagamento})
dtGeneralOrdersClusterizado.head()
| freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | customer_region | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13.29 | Credito | 2 | 72.19 | 5 | 1 | 9.0 | 7.0 | RJ | Sudeste | 2 |
| 1 | 19.93 | Credito | 3 | 259.83 | 4 | 1 | 3.0 | 16.0 | SP | Sudeste | 1 |
| 2 | 17.87 | Credito | 5 | 216.87 | 5 | 1 | 14.0 | 8.0 | MG | Sudeste | 0 |
| 3 | 12.79 | Credito | 2 | 25.78 | 4 | 1 | 6.0 | 6.0 | SP | Sudeste | 1 |
| 4 | 18.14 | Credito | 3 | 218.04 | 5 | 1 | 16.0 | 25.0 | SP | Sudeste | 1 |
dtClusters = pd.DataFrame()
dtClusters['Preco_Medio'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['payment_value'].mean(), 2)
dtClusters['Frete_Medio'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['freight_value'].mean(), 2)
dtClusters['Media_Pagamentos'] =\
round(dtGeneralOrdersClusterizado.groupby('cluster')['payment_installments'].mean(), 2)
dtClusters['Score_Medio'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['review_score'].mean(), 2)
dtClusters['Itens_Medio'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['quantity'].mean(), 2)
dtClusters['Dif_Estimado_Entregue'] =\
round(dtGeneralOrdersClusterizado.groupby('cluster')['diff_delivery_and_estimate'].mean(), 2)
dtClusters['Dif_Compra_Entregue'] =\
round(dtGeneralOrdersClusterizado.groupby('cluster')['diff_delivery_and_purchase'].mean(), 2)
dtClusters['Pagamento_Prevalente'] =\
dtGeneralOrdersClusterizado.groupby('cluster')['payment_type'].agg(lambda x: x.value_counts().index[0])
dtClusters['Estado_Prevalente'] =\
dtGeneralOrdersClusterizado.groupby('cluster')['customer_state'].agg(lambda x: x.value_counts().index[0])
dtClusters['Regiao_Prevalente'] =\
dtGeneralOrdersClusterizado.groupby('cluster')['customer_region'].agg(lambda x: x.value_counts().index[0])
display(HTML("<style>.container { width:100% !important; }</style>"))
dtClusters.to_csv('analise_clusters.csv')
dtClustersIndexado = dtClusters.reset_index()
dtClustersIndexado = dtClustersIndexado.rename(columns = {'cluster': 'Cluster'})
dtClustersIndexado['Cluster'] = dtClustersIndexado['Cluster'].apply(lambda x: 'Cluster ' + str(x))
headerColor = 'grey'
rowEvenColor = 'lightgrey'
rowOddColor = 'white'
fig = go.Figure(data=[go.Table(
header=dict(
values=list(dtClustersIndexado.columns),
line_color='darkslategray',
fill_color=headerColor,
align=['left','center'],
font=dict(color='white', size=12)
),
cells=dict(
values=[dtClustersIndexado.Cluster, dtClustersIndexado.Preco_Medio, dtClustersIndexado.Frete_Medio, dtClustersIndexado.Media_Pagamentos,\
dtClustersIndexado.Score_Medio, dtClustersIndexado.Itens_Medio, dtClustersIndexado.Dif_Estimado_Entregue,\
dtClustersIndexado.Dif_Compra_Entregue, dtClustersIndexado.Pagamento_Prevalente, dtClustersIndexado.Estado_Prevalente,\
dtClustersIndexado.Regiao_Prevalente],
line_color='darkslategray',
fill_color = [[rowOddColor,rowEvenColor,rowOddColor, rowEvenColor]*10],
align = ['left', 'center'],
font = dict(color = 'darkslategray', size = 11)
))
])
fig.show()